Analysis of White Wine Quality by Bantwale D. Enyew

Introduction

This dataset I use for analysis was created by Paulo Cortez (University of Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (2009), which availabe for puplic for reseach. The dataset is about the physicochemical variables that affects the quality of the Portuguese “Vinho Verde” White wine, more description about the dataset can be found on this link.

Load the Data

## [1] "/home/banito/Downloads/project4"
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

The dataset consist of 4898 observation with 12 varaibles and the dataset doesn’t have any missing values

Univariate Plots Section.

In this section we will have some insights about the dataset by simply plotting the varaibles and see thier distibutions.The fixed.acidity is the none-volitile acid amount found on wine, it have maximum and minimum of 3.8 and 14.2 in units of (g/dm^3).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing missing values (geom_bar).

## Warning: Removed 45 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

The volatile.acidity indicates the amount of acetic acid in the wine, its content ranges from 0.08 to 1.1 in gm/dm^3. Let us plot the histogram for volitle acidity found in white wine for the whole observation dataset. The amount of acetic acid in the wine is mostly bound between 0.1 and 0.8 as depicted by the histogram

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 rows containing missing values (geom_bar).

Citric acid: found in small quantities can add ‘freshness’ and flavor to wines. The amount of citric acid in g/dm^3 ranges from 0 to 1.66. The historgram more or less exibits normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 rows containing missing values (geom_bar).

## Warning: Removed 22 rows containing non-finite values (stat_bin).

Residual sugar is the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.The maximum and minimum residual sugar found in the wine is 0.6 and 65 respectively. The residual.sugar varaible is a tailed distribution which is skwed to the right

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 240 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

Chlorides indicates the amount of salt (Sodium Chloride , g/dm^3) presence in the wine. In the dataset the value ranges from 0.009 to 0.34. The histogram for Chloride depicts that the distribution exipts normal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 237 rows containing non-finite values (stat_bin).

Free sulfur dioxide is the free form of SO2 (in mg / dm^3) that exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. Its value ranges from 2.0 to 289. The histogram indicates normal distribution for sulfur dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 237 rows containing non-finite values (stat_bin).

The total sulfur dioxide is the sum of amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. Summary for the variable shows the maximum and minimum value is 9.0 to 440 in mg/dm^3 respectively. The histogram dipicts closer to normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 6 rows containing non-finite values (stat_bin).

The density of the wine range between 0.9871 and 1.039 in g/cm^3, the histogram depicts normal distibution of the density varaible

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

The pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. The value for pH in the observation ranges from 2.720 to 3.82. The histogram exibihit normal distrubution for pH.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 rows containing missing values (geom_bar).

Sulphates is a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. The amount of sulphates in the wine ranges from 0.22 to 1.08.The histogram shows close to normal distibution of Sulphate content in the wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 229 rows containing non-finite values (stat_bin).

The percentage alcohol content of the wine in ppm (parts per million) is the other input varaible, it varies from 8.0 to 14.2. The histogram show a distribution skwed to the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 rows containing missing values (geom_bar).

The last attribute varaibale in the dataset is the quality which is the output variable that is affected by the combination of the other input varaible. The volatile.acidity indicates the amount of acetic acid in the wine, its content ranges from 0.08 to 1.1 in gm/dm^3. Let us plot the histogram for volitle acidity found in white wine for the whole observation dataset. The amount of acetic acid in the wine is mostly bound between 0.1 and 0.8 as depicted by the histogram

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing missing values (geom_bar).

Univariate Analysis

What is the structure of your dataset?

The dataset contains 4898 observations with 12 variables generally, 11 input variables based on physicochemical tests which affect wine quality and 1 output variable which is the quality of the wine. The input varaibles are:- fixed acidity (tartaric acid - g / dm^3), volatile acidity (acetic acid - g / dm^3), citric acid (g / dm^3), residual sugar (g / dm^3), chlorides (sodium chloride - g / dm^3, free sulfur dioxide (mg / dm^3), total sulfur dioxide (mg / dm^3), density (g / cm^3), pH, sulphates (potassium sulphate - g / dm3), alcohol (% by volume) and the output varaible based on sensor data is thw qaulity score between 0 and 10.At least 3 wine experts rated the quality of each wine, with 0 as the lowest rating and 10 as the highest rating.

What is/are the main feature(s) of interest in your dataset?

The quality of the white wine is what ulimately matters the most, all the other inputs combination is to prouce a certain kind of test which is attributed to quality of the wine. Most wine quality is concentrated in the cagegories 5, 6, 7, small amount of whitle wine falls to categores 3, 4, 8 and 9 and not in the categories 1,2, and 10.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

In the univariate plot section, I have done histogram for all varaibles in the dataset and studied the distibution, most of the varariables are close to normal distributions. Alcohol and residual sugar exibits distribution skwed to the right. Volitile.acidity and citric acid show some irregularies.

Did you create any new variables from existing variables in the dataset?

I didn’t create any new varaible here ### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When plotting histograms of the attribute variables, I applied the xlim and the 99 or 95 percentile (quantile function) to limit the upper x-axis value to remove outlier and for better visualization and to see clearer distibutio of that varaible.

Bivariate Plots Section

The correlation coefficient between alcohol and quality of white wine is 0.4355747, that shows there is postive correlation.

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$alcohol
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747

The correlation coefficient is -0.1136628, which show a negative correrlation between quality and fixed.acidity

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$fixed.acidity
## t = -8.005, df = 4896, p-value = 1.48e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.14121974 -0.08592991
## sample estimates:
##        cor 
## -0.1136628

The is a very weak or no correlation between citric.acid and quality, the correaltion cofficient is given as -0.009209091

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$citric.acid
## t = -0.6444, df = 4896, p-value = 0.5193
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03720595  0.01880221
## sample estimates:
##          cor 
## -0.009209091

### negative correlation exists between quality and residual.sugar

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$residual.sugar
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.12524103 -0.06976101
## sample estimates:
##         cor 
## -0.09757683

Negative correaltion between quality and chlorides with correaltion cofficient of -0.2099344

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$chlorides
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2365501 -0.1830039
## sample estimates:
##        cor 
## -0.2099344

### A weak positive correlation between free.surfur.dioxide and quality

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$free.sulfur.dioxide
## t = 0.57085, df = 4896, p-value = 0.5681
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01985292  0.03615626
## sample estimates:
##         cor 
## 0.008158067

Negative correlation between total.sulfur.dioxide and quality

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$total.sulfur.dioxide
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2017563 -0.1474524
## sample estimates:
##        cor 
## -0.1747372

Still the correaltion between density and quality is negative with a correlation coefficient of -0.3071233

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3322718 -0.2815385
## sample estimates:
##        cor 
## -0.3071233

### Positive correaltion between quality and pH of white wine with correlation cofficient of 0.09942725

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$pH
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07162022 0.12707983
## sample estimates:
##        cor 
## 0.09942725

Positive correlation exists between quality and and surphates in white wine, the correaltion coffecient is 0.05367788

## 
##  Pearson's product-moment correlation
## 
## data:  wht_wine_quality$quality and wht_wine_quality$sulphates
## t = 3.7613, df = 4896, p-value = 0.000171
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02571007 0.08156172
## sample estimates:
##        cor 
## 0.05367788

overlaying summary with raw data

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

In the above plots, I have tried to see the correaltion and distribution of each input vararaible with the output varaible (i.e. the quality of white wine). Among the input varaibles alcohol,free.sulfur.dioxide and sulphate have a positve correaltion (correlation coefficient of 0.4355747) with quality, alcohol being having a stong correaltion with quality, that is the highest the alcoholic content in percentage found in the white wine the best is the quality of the wine. Other physicochemical variables are nagatively correlated with the quality of the wine, citric.acid has the weakest correlation with quality (correaltion coefficient of -0.009209091).

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Except the three input varaibles, the other varaibles have a week negative correaltion with the quality of the white wine.

What was the strongest relationship you found?

Among all the input varaibles, the quality of the white wine is strongely correlated with the alcohol level in the wine.

Multivariate Plots Section

Let us plot key varaible against each other using ‘ggpiar’ function and built simple ###mutiple linear regeression model

## 
## Call:
## lm(formula = quality ~ alcohol + chlorides + citric.acid + density + 
##     fixed.acidity + free.sulfur.dioxide + pH + residual.sugar + 
##     sulphates + total.sulfur.dioxide + volatile.acidity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8348 -0.4934 -0.0379  0.4637  3.1143 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.502e+02  1.880e+01   7.987 1.71e-15 ***
## alcohol               1.935e-01  2.422e-02   7.988 1.70e-15 ***
## chlorides            -2.473e-01  5.465e-01  -0.452  0.65097    
## citric.acid           2.209e-02  9.577e-02   0.231  0.81759    
## density              -1.503e+02  1.907e+01  -7.879 4.04e-15 ***
## fixed.acidity         6.552e-02  2.087e-02   3.139  0.00171 ** 
## free.sulfur.dioxide   3.733e-03  8.441e-04   4.422 9.99e-06 ***
## pH                    6.863e-01  1.054e-01   6.513 8.10e-11 ***
## residual.sugar        8.148e-02  7.527e-03  10.825  < 2e-16 ***
## sulphates             6.315e-01  1.004e-01   6.291 3.44e-10 ***
## total.sulfur.dioxide -2.857e-04  3.781e-04  -0.756  0.44979    
## volatile.acidity     -1.863e+00  1.138e-01 -16.373  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2803 
## F-statistic: 174.3 on 11 and 4886 DF,  p-value: < 2.2e-16

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Employing the ‘ggpairs’ function in GGally, I have plotted a chart that shows a quick summary of the different varaibles such as scatter plot, boxplot , correaltion cofficient etc.

Were there any interesting or surprising interactions between features?

From the ‘ggpairs’ that shows the correlation among the varaible, there is not strong correlation among each variables, the exception is that there is a strong correlation betweeen density and total.sulfur.dioxide (0.839), for the other correlation cofficient is below 0.5 in absolute terms.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

### Yes, I have build a multiple linear regression model. The limitation of the model would be that the data is only collected for one type of wine varity and only from one manufacturer and could not represent to model the quality of wine based on physicochemical input varaibles for other manufacturer and wine varities in general.

Final Plots and Summary

Plot One

## Warning: Removed 5 rows containing non-finite values (stat_bin).

Description One

The histogram plot of quality of the White wine from all observations in the data set depicts that most of white wine quality in the observation falls to quality level of 6, 5, and 7 respectively. Only few numbers of observationf of the white wine falls in quality level of 8,4 and 3. None of the observations falls into the highest quality of wine level 9 and 10 and either of the lowest quality level of 0, 1 and 2.

Plot Two

Description Two

Above we did correaltion between the quality of the white one and the 11 input varaible that affect the quality of wine. The alcohol level have show srong correlation as compared to other varaibles in the observation. This plot show the box plot for each quality level or catagory againest the alcoholic level presence and also the linear fit line. The boxplot for quality levels 5, 6 and 7 show that there are many outliers that lies below the 25 and above the 75 percentile. The linear fit line depicts the fact that ther is a postive correlation between quality and alcohol level of the white wine.

Plot Three

Description Three

The plot shows the density plottes againest alcohol level and classified by quality of the white wine. It show that white wines with medium level of quality (6) tends to have low alcohol and high density and white wines with higher quality (8 and 9) tends to have low density and higher alcoholic percentage presence in the wine. The linear model fit between the density and alcohol also shows that there is a negative correaltion between them for each quality category or level.

Reflection

In Summary, the above analysis have show that the quality of the white wine depends on the physicochemical input varaibles. The quality of the wine is posively correlated with alcohol, free.sulfur.dioxide, sulphate and pH and it is negatively correlated with the other physicochemical input varaibles. The correlation among each varaibles in not that strong all correlation coefficient is below 0.5 in abosute terms, the execption is that there is a strong correlation between density and total.sulfur dioxide which is above 0.8.

Reference

  1. https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt

  2. https://onlinecourses.science.psu.edu/stat857/node/224
  3. http://ggobi.github.io/ggally/#generic_example
  4. http://ggobi.github.io/ggally/rd.html#ggpairs_alias